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Learning with Square Loss: Localization through Offset 
Rademacher Complexity 

Tengyuan Liang * Alexander Rakhlin * Karthik Sridharan ^ 


Abstract 

We consider regression with square loss and general classes of functions without the boundedness assump¬ 
tion. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization 
both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step es¬ 
timator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the 
convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of [18] for the 
bounded case while also providing guarantees without the boundedness assumption. 


1 Introduction 

Determining the finite-sample behavior of risk in the problem of regression is arguably one of the most basic 
problems of Learning Theory and Statistics. This behavior can be studied in substantial generality with the tools of 
empirical process theory. When functions in a given convex class are uniformly bounded, one may verify the so- 
called “Bernstein condition.” The condition—^which relates the variance of the increments of the empirical process 
to their expectation—implies a certain localization phenomenon around the optimum and forms the basis of the 
analysis via local Rademacher complexities. The technique has been developed in [9, 8, 5, 2, 4], among others, 
based on Talagrand’s celebrated concentration inequality for the supremum of an empirical process. 

In a recent pathbreaking paper, [14] showed that a large part of this heavy machinery is not necessary for 
obtaining tight upper bounds on excess loss, even—and especially—if functions are unbounded. Mendelson ob¬ 
served that only one-sided control of the tail is required in the deviation inequality, and, thankfully, it is the tail 
that can be controlled under very mild assumptions. 

In a parallel line of work, the search within the online learning setting for an analogue of “localization” has led 
to a notion of an “offset” Rademacher process [17], yielding—in a rather clean manner—optimal rates for minimax 
regret in online supervised learning. It was also shown that the supremum of the offset process is a lower bound on 
the minimax value, thus establishing its intrinsic nature. The present paper blends the ideas of [14] and ]17]. We 
introduce the notion of an offset Rademacher process for i.i.d. data and show that the supremum of this process 
upper bounds (both in expectation and in high probability) the excess risk of an empirical risk minimizer (for 
convex classes) and a two-step Star estimator of [1] (for arbitrary classes). The statement holds under a weak 
assumption even if functions are not uniformly bounded. 

The offset Rademacher complexity provides an intuitive alternative to the machinery of local Rademacher av¬ 
erages. Let us recall that the Rademacher process indexed by a function class ^ c is defined as a stochastic 
process g ^ ^H^iCtgixt] where xi,...,x„ e .3^ are held fixed and ei,...,e„ are i.i.d. Rademacher random vari¬ 
ables. We define the offset Rademacher process as a stochastic process 
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for some c > 0. The process itself captures the notion of localization: when g is large in magnitude, the negative 
quadratic term acts as a compensator and “extinguishes” the fluctuations of the term involving Rademacher vari¬ 
ables. The supremum of the process will be termed ojfset Rademacher complexity, and one may expect that this 
complexity is of a smaller order than the classical Rademacher averages (which, without localization, cannot be 
better than the rate of 

The self-modulating property of the offset complexity can be illustrated on the canonical example of a linear 
class = {x >-► w^x: w e R^}, in which case the offset Rademacher complexity becomes 
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where Z = XtxJ. Under mild conditions, the above expression is of the order & [pin) in expectation and in 
high probability — a familiar rate achieved by the ordinary least squares, at least in the case of a well-specified 
model. We refer to Section 6 for the precise statement for both well-specified and misspecified case. 

Our contributions can be summarized as follows. First, we show that offset Rademacher complexity is an up¬ 
per bound on excess loss of the proposed estimator, both in expectation and in deviation. We then extend the 
chaining technique to quantify the behavior of the supremum of the offset process in terms of covering numbers. 
By doing so, we recover the rates of aggregation established in [18] and, unlike the latter paper, the present method 
does not require boundedness (of the noise and functions). We provide a lower bound on minimax excess loss in 
terms of offset Rademacher complexity, indicating its intrinsic nature for the problems of regression. While our 
in-expectation results for bounded functions do not require any assumptions, the high probability statements rest 
on a lower isometry assumption that holds, for instance, for subgaussian classes. We show that offset Rademacher 
complexity can be further upper bounded by the fixed-point complexities defined by Mendelson [ 14] . We conclude 
with the analysis of ordinary least squares. 


2 Problem Description and the Estimator 

Let be a class of functions on a probabUity space [3^, Px) ■ The response is given by an unknown random variable 
Y, distributed jointly with X according to P-Px^ Py\x- We observe a sample {Xi, Yi),..., (X„, ¥„] distributed i.i.d. 
according to P and aim to construct an estimator / with small excess loss S{f), where 

Sig] = E(g - Yf - inf E(/ - Yf (1) 

and E(/- Y]^ = E(/(X) - Y)^ is the expectation with respect to (X, Y). Let E denote the empirical expectation 
operator and define the following two-step procedure: 

g = argminE(/(X)-Y)^ /= argmin E(/(X) - 7)^ (2) 

/estar(^,g) 

where star(.^, g) = {Ag + [1 - A)f : f e e [0,1]} is the star hull of ^ around g. (we abbreviate star(,^, 0) as 
star(,^).) This two-step estimator was introduced (to the best of our knowledge) by [1] for a finite class We 
will refer to the procedure as the Star estimator. Audibert showed that this method is deviation-optimal for finite 
aggregation — the first such result, followed by other estimators with similar properties [10, 6] for the finite case. 
We present analysis that quantifies the behavior of this method for arbitrary classes of functions. The method has 
several nice features. First, it provides an alternative to the 3-stage discretization method of [18], does not require 
the prior knowledge of the entropy of the class, and goes beyond the bounded case. Second, it enjoys an upper 
bound of offset Rademacher complexity via relatively routine arguments under rather weak assumptions. Third, 
it naturally reduces to empirical risk minimization for convex classes (indeed, this happens whenever star(,^, g) = 

m. 

Let f* denote the minimizer 

/* = argmin E(/(X) - Yf, 

/e.^ 
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and let ^ denote the “noise’ 


^-Y-r. 

We say that the model is misspecified if the regression function E[F|X - x] which means ^ is not zero-mean. 
Otherwise, we say that the model is well-specified. 


3 A Geometric Inequality 

We start hy proving a geometric inequality for the Star estimator. This deterministic inequality holds conditionally 
onXi,...,Xn, and therefore reduces to a problem in IR”. 

Lemma 1 (Geometric Inequality). The two-step estimator / in (2) satisfies 

E{h-Yf-Eif-Yf>c-E{f-hf (3) 

for any he^ and c - 1/18. If^ is convex, (3) holds with c = 1. Moreover, if^ is a linear subspace, (3) holds with 
equality and c-\by the Pythagorean theorem. 

Remark 1. In the absence of convexity of^, the two-step estimator f mimics the key Pythagorean identity, though 
with a constant 1/18. We have not focused on optimizing c but rather on presenting a clean geometric argument. 



Proof of Lemma 1. Define the empirical £2 distance to he, for any f,g, ||/||„ := [Ef^fi^^ and empirical product to 
he {f,g)n E[/g]. We will slightly abuse the notation by identifying every function with its finite-dimensional 
projection on (Xi,...,X„). 

Denote the ball (and sphere) centered at Y and with radius ||g - F||„ to be ^(F, ||g- F||„) (and 

correspondingly). In a similar manner, define SBz := 5^(F, ||/- F||„) andS^z- By the definition of the Star algorithm, 
we have S §2 ^ SSi. The statement holds with c = 1 if / = g, and so we may assume S§z •SSi. Denote by the conic 
hull of S§z with origin at g. Define the spherical cap outside the cone 'lg’tobe5^ = ,5^i\'i^ (drawn in red in Figure 3) . 

First, by the optimality of g, for any he we have \\h- F||^ > ||g- F||^, i.e. any he is not in the interior 
of S§i. Furthermore, h is not in the interior of the cone ^, as otherwise there would be a point inside SBz strictly 
better than /. Thus h e n (intS^i)*^. 

Second, / e S^z and it is a contact point of Sg” and £Pz- Indeed, / is necessarily on a line segment between g and 
a point outside SB\ that does not pass through the interior of SSz by optimality of /. Let K be the set of all contact 
points - potential locations of /. 

Now we fix h e and consider the two dimensional plane T£ that passes through three points (g, F, h), de¬ 
picted in Figure 3. Observe that the left-hand-side of the desired inequality (3) is constant as / ranges over K. 
To prove the inequality it therefore suffices to choose a value f'^K that maximizes the right-hand-side. The 
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maximization of\\h- f'\\^ over f eK is achieved hy f e Kn S£. This can he argued simply by symmetry: the two- 
dimensional plane 5£ intersects span(in in a line and the distance between h and K is maximized at the extreme 
point of this intersection. Hence, to prove the desired inequality, we can restrict our attention to the plane 5£ and 
/' instead of /. 

For any he^, define the projection of h onto the shell n to he h± e . We first prove (3) for h± and then 
extend the statement to h. By the geometry of the cone, 

||/'-g||„>^||g-hxll„. 

By triangle inequality, 

\\f'-mn^l\\g-h^\\n>\{\\f'-h^\\n-\\f'- gWn) ■ 

Rearranging, 

Wf -g\\l^\\\f -hL\\l. 

By the Pythagorean theorem, 

11^1 - Y\\l - II/' - Y\\l = llg- Y\\l - II/' - Y\\l = II/' - g||2 > i||/' - hxlli 
thus proving the claim for hj_ for constant c = 1/9. 

We can now extend the claim to h. Indeed, due to the fact that h e (int'rg’)'' n (intS^i)" and the geometry of the 
projection h±,we have (h± - Y, h± - h)„ ^ 0- Thus 

11^ - Y\\l - II/' - Y\\l = ||/2i - h\\l + II hi - Y\\l - 2</ii - F, hi - h)n - II/' - F||"„ 

> II hi - h||2 + (II hi - Y\\l - II/' - YWl) 
^\\hL-h\\l+\\\f-hx.f„^^[\\h^-h\\n + \\f- hi II nf 

9 lo 


This proves the claim for h with constant 1/18. 


□ 


An upper bound on excess loss follows immediately from Lemma 1. 

Corollary 2. Conditioned on the data {X„, Yn\, we have a deterministic upper bound for the Star algorithm: 

S(f) < (E- E) [2(/* - F)(/* - /)] + E(/* - ff - (1 + c) • E(/* - ff, (4) 

with the value of constant c given in Lemma 1. 

Proof 

Sif) = EifiX) - Yf - inf EifiX) - Yf 
/e.^ 

< E(/- Yf - E(f* - Yf + [E(/* - Yf-Hf- Yf -c-E{f- f*f] 

= (E- E)[2(/* - Y]{f* - /)] -t E(/* - ff - (1 + c) •!(/* - ff. 


□ 

An attentive reader will notice that the multiplier on the negative empirical quadratic term in (4) is slightly 
larger than the one on the expected quadratic term. This is the starting point of the analysis that follows. 


4 


4 Symmetrization 

We will now show that the discrepancy in the multiplier constant in (4) leads to offset Rademacher complexity 
through rather elementary symmetrization inequalities. We perform this analysis hoth in expectation (for the case 
of bounded functions) and in high probability (for the general unbounded case). While the former result follows 
from the latter, the in-expectation statement for bounded functions requires no assumptions, in contrast to control 
of the tails. 

Theorem 3. Define the setJ6‘ ^ - f* + stari^ - The following expectation bound on excess loss of the Star 

estimator holds: 

^ f 1 ” 

ESif) < {2M + K{2 + c)/2) • E sup ^ - V 2eih{Xi) - c'h(Xif 

\ n ,-=i 

where Cl,...,Cn cite independent Rademacher random variables, c' - min{||^, ^ ~ sup^|/|oo, and M - 

sup f\Y - f\^ almost surely. 

The proof of the theorem involves an introduction of independent Rademacher random variables and two 
contraction-style arguments to remove the multipliers (T/-/* (X,)). These algebraic manipulations are postponed 
to the appendix. 

The term in the curly brackets will be called an offset Rademacher process, and the expected supremum — an 
offset Rademacher complexity. While Theorem 3 only applies to bounded functions and bounded noise, the upper 
bound already captures the localization phenomenon, even for non-convex function classes (and thus goes well 
beyond the classical local Rademacher analysis). 

As argued in [14], it is the contraction step that requires boundedness of the functions when analyzing square 
loss. Mendelson uses a small ball assumption (a weak condition on the distribution, stated below) to split the 
analysis into the study of the multiplier and quadratic terms. This assumption allows one to compare the expected 
square of any function to its empirical version, to within a multiplicative constant that depends on the small ball 
property. In contrast, we need a somewhat stronger assumption that will allow us to take this constant to be at 
least 1 - c/4. We phrase this condition—the lower isometry bound—as follows. ^ 

Definition 1 (Lower Isometry Bound). We say that a function class ^ satisfies the lower isometry bound with some 
parameters 0 < 77 < 1 and 0 < d < 1 if 


If f{Xi) 

inf — > -^ > 1-77 

j€^\{0} ti i=i 


> 1 -d 


(5) 


for all 71 > no{^,S,T]), where no{^,S,ri) depends on the complexity of the class. 

In general this is a mild assumption that requires good tail behavior of functions in yet it is stronger than 
the small ball property. Mendelson [16] shows that this condition holds for heavy-tailed classes assuming the small 
ball condition plus a norm-comparison property II/He,, ^ h|l/lle 2 > ^ remark that Assumption 1 holds 

for sub-gaussian classes ^ using concentration tools, as already shown in [11]. For completeness, let us also state 
the small ball property: 

Definition 2 (Small Ball Property [14, 15]). The class of functions ^ satisfies the small-ball condition if there exist 
constants 7 C> 0 and 0 < c < 1 for every f 


P(|/(X)|>)c(E/2)i/2)>e. 

Armed with the lower isometry bound, we now prove that the tail behavior of the deterministic upper bound 
in (4) can be controlled via the tail behavior of offset Rademacher complexity. 

^ We thank Shahar Mendelson for pointing out that the small hall condition in the initial version of this paper was too weak for our purposes. 
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Theorem 4. Define the set ^ - f* + star{^ -3'). Assume the lower isometry bound in Definition 1 holds with 
Tj - c/4 and some 6 < 1, where c is the constant in (3). Let^i - Yi - f* (X,). Define 


(Eh2)2 


A sup ^ and B:=supE^^. 

' X,Y 


Then there exist two absolute constants c',c>0 (only depends on c), such that 


>(^(/) >4m) <4(5 + 4P 


sup -Y^Ci^ihiXi)- c-HXif > u 

^ i=i 


for any 

32\/AB 1 

u > -—, 

c n 

as long as n > v no{3£’,d,cl4). 

Theorem 4 states that excess loss is stochastically dominated hy offset Rademacher complexity. We remark that 
the requirement in A, B holds under the mild moment conditions. 

Remark 2. In certain cases, Definition 1 can be shown to hold for f e 3 \ r* SS (rather than all f e 3), for some 
critical radius r*, as soon as n> no{3,6,ri, r*] (see [16]). In this case, the bound on the offset complexity is only 
affected additively by (r *)2. 

We postpone the proof of the Theorem to the appendix. In a nutshell, it extends the classical prohahilistic 
symmetrization technique [7, 13] to the non-zero-mean offset process under the investigation. 


5 Offset Rademacher Process: Chaining and Critical Radius 

Let us summarize the development so far. We have shown that excess loss of the Star estimator is upper hounded 
hy the (data-dependent) offset Rademacher complexity, both in expectation and in high prohahility, under the 
appropriate assumptions. We claim that the necessary properties of the estimator are now captured hy the off¬ 
set complexity, and we are now squarely in the realm of empirical process theory. In particular, we may want to 
quantify rates of convergence under complexity assumptions on 3, such as covering numbers. In contrast to local 
Rademacher analyses where one would need to estimate the data-dependent fixed point of the critical radius in 
some way, the task is much easier for the offset complexity. To this end, we study the offset process with the tools 
of empirical process theory. 


5.1 Chaining Bounds 

The first lemma describes the behavior of offset Rademacher process for a finite class. 
Lemma 5. LetV cIR" be a finite set of vectors of cardinality N. Then for any C>0, 


Ee max 

V€V 


- Cy? 


n rr 


i=l 


1 log AT 
2C n 


Furthermore, for any 6 > 0, 


max 

veV 


1 ^ 

-J^CiVi-Cv'^ 


> 


1 logiV-i-logl/d 
^ n 
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When the noise ^ is unbounded, 


where 


Ecmax 

V£V 


1 ” 


<M- 


\ogN 


max 

veV 


1 A 


Y,£i^iVi-Cv'^ 


i=l 


>M 


logAr+logl/5 


<S, 


I" 1 V:£‘: 

51 /T ^1 = 1 I ^ I 

M:- sup --- 

i;EV"\{0} = l I",- 


( 6 ) 


Armed with the lemma for a finite collection, we upper hound the offset Rademacher complexity of a general 
class through the chaining technique. We perform the analysis in expectation and in prohahility. Recall that a 6 - 
cover of a subset S in a metric space {T, d) is a collection of elements such that the union of the 5-balls with centers 
at the elements contains S. A covering number at scale 5 is the size of the minimal 5-cover. 

One of the main objectives of symmetrization is to arrive at a stochastic process that can be studied condition¬ 
ally on data, so that all the relevant complexities can be made sample-based (or, empirical). Since the functions 
only enter offset Rademacher complexity through their values on the sample Xi,..., we are left with a finite¬ 
dimensional object. Throughout the paper, we work with the empirical (2 distance 


dn(f,g) 


1 ^ ' 


1/2 


The covering number of at scale 5 with respect to will be denoted by J {2 5). 


Lemma 6. Let ^ be a class of functions from 2 to K. Then for any zi,..., z„ e 2 


Ee sup 


-Y.£igiZi)-Cg{Zif 


< inf 
7>0,a:e[0,7l 


(z/oiog^zC^.r) 

n 


12 n ! - 

+4a+^^ \\ogjV2{^,8)d8 

■Jn Ja ^ 


where J/ 2 fS ,Y'1 isanL 2 -cov^tof‘^ on (zi,...,z„) atscalej (assumed to contain 0 ). 


Instead of assuming that 0 is contained in the cover, we may simply increase the size of the cover by 1, which 
can be absorbed by a small change of a constant. 

Let us discuss the upper bound of Lemma 6 . First, we may take a = 0, unless the integral diverges (which 
happens for very large classes with entropy growth of \ogjY 2 {’^, 8 ) ~ 8 ~f, p > 2). Next, observe that first term is 
precisely the rate of aggregation with a finite collection of size ^('(^, 7 ). Hence, the upper bound is an optimal 
balance of the following procedure: cover the set at scale 7 and pay the rate of aggregation for this finite collection, 
plus pay the rate of convergence of ERM within a 7 -ball. The optimal balance is given by some 7 (and can be easily 
computed under assumptions on covering number behavior — see [ 1 7] ). The optimal 7 quantifies the localization 
radius that arises from the curvature of the loss function. One may also view the optimal balance as the well-known 
equation 

log2^(^,Y) _ 2 
n ^ 

studied in statistics [19] for well-specified models. The present paper, as well as [18], extend the analysis of this 
balance to the misspecified case and non-convex classes of functions. 

Now we provide a high probability analogue of Lemma 6 . 
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Lemma 7. Lef^ he a class of functions from 2 to K. Then for any zi,..., e and any u>Q 




1 -e 


sup 

2 2 
-) + exp(-M) 


t=l 


u- inf |4aH —— f J\o%Jf 2 &,S)d^ + 
ae[0,7l I VnJa ^ ] 


2 + u 


C 


where JV 2 7 ) is an £2 -cover of‘^ on {zi,..., Zn) at scale 7 (assumed to contain 0 ) and C, c> 0 are universal constants. 


The above lemmas study the behavior of offset Rademacher complexity for abstract classes Observe that the 

upper bounds in previous sections are in terms of the class ^ - f* star(^ - This class, however, is not more 
complex that the original class ^ (with the exception of a finite class More precisely, the covering numbers of 
{f-t g: fe^,ge ^'] and .^-,^'{f - g: f e ^,ge^'] are bounded as 


logMi 2e], logM^ , 2c) < logM^ [^,e)-t logM^ e) 


for any (2,(2'. The following lemma shows that the complexity of the star hull star(,^) is also not significantly 
larger than that of . 

Lemma 8 ([12], Lemma 4.5). For any scale c> 0, the covering number of^ c ^§2 and that ofstar{(2) are bounded 
in the sense 


logMi(,^,2e-) < logM^(stor(,^),2e) < log- +logM^(,^,e). 

e 


5.2 Critical Radius 


Now let us study the critical radius of offset Rademacher processes. Let f* -Y and define 


ani-^yKfr) = inft r > 0: 


sup 


\ ^ i=l 


- f; 2efrih[Xi) -c'-f^ h^iXi) j. < Tcr^ 


i=i 


>1-5 


(7) 


Theorem 9. Assume J€ is star-shaped around 0 and the lower isometry bound holds for 8,e. Define the critical 
radius 

r- aniJF, c'(l-e),5). 

Then we have with probability at least 1 - 25, 


sup 

h^.X 


^ j^cfrMXi) - c’-Y,h^{Xi) \ ^ 


i=l 


i=l 


sup 

h^J€r\r^ 


-j^eii,ih{Xi)-d-j^h^[Xi)\, 


i=l 


z=l 


which further implies 


sup \ -Y,£iiih[Xi)-c'- ^ h^[Xi) i < r^. 

ftejr [ ^ i=l ” !=l J 

The first statement of Theorem 9 shows the self-modulating behavior of the offset process: there is a critical 
radius, beyond which the fluctuations of the offset process are controlled by those within the radius. To understand 
the second statement, we observe that the complexity is upper bounded by the corresponding complexity in 
[14], which is defined without the quadratic term subtracted off. Hence, offset Rademacher complexity is no larger 
(under our Assumption 1) than the upper bounds obtained by [14] in terms of the critical radius. 











6 Examples 

In this section, we briefly describe several applications. The first is concerned with parametric regression. 

Lemma 10. Consider the parametric regression Yi - XTp* + 1 < i < n, where need not be centered. The ojfset 

Rademacher complexity is bounded as 


E, sup J - f ICi^iXjp - Cp^XiXjp 
/3eRP [ ” i=\ 


Cn 


and 


sup J p - Cp'^XiXjp \ > 

^^eRP 1 ” i=l 




Cn 


(4^/2log^+64log^) 


<S 


where G~ 'EJ}_]^XiXj is the Gram matrix and H = (,^^XiXj. In the well-specified case (that is, f ,• are zero-mean), assuming 

that conditional variance is , then conditionally on the design matrix, EG~^H -a^Ip and excess loss is upper bounded by order 


Proof. The offset Rademacher can be interpreted as the Fenchel-Legendre transform, where 

’'T 1 ■ 

I / . . , t ,• t ; c ,• t- ; 

rT ! 


sup \ ^ ICi^iXjp-Cp^XiXjp 
/3€RP [!=i 




Cn 


( 8 ) 


Thus we have in expectation 


Ee- sup \ ^ 2ei^iXjp-Cp'^XiXjp\ = 
” ;8eRP [ i=l 


T.U^]xjG-^Xi 


1 (\''n c2 V vT\ 


Cn 


Cn 


(9) 


For high probability bound, note the expression in Equation (8) is Rademacher chaos of order two. Define sym¬ 
metric matrix M e IR”’* ” with entries 


and define 


Then 


and 


Furthermore, 


Mij^^i^jXjG-^Xj 


Z= £ CiCj^i^jXjG-^Xj^ £ CiCjMij. 
i,j=l i,j=l 


EZ = tr[G-i(^<?XiXf)], 
i=l 


E£(£ CjMijf = IIMIll = tr[G-i(£ ^?XiXf)G-i(£ ^]XiXj)]. 

z=l 7=1 /=1 i=l 




i=l 


i=l 


We apply the concentration result in [3, Exercise 6.9], 


»(z-EZ>4\/2||M||FN/r-t64||M||f] < e~*. 


( 10 ) 

□ 
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For the finite dictionary aggregation problem, the following lemma shows control of offset Rademacher com¬ 
plexity. 

Lemmall. Assumed is a finite class of cardinality N. Define J6’- - f* + star{t^ -which contains the 

Star estimator f - f* defined in Equation (2). The ojfset Rademacher complexity for ^ isboundedas 


and 


Ee sup i - f; 2ei^ih[Xi) - CHXif 
I ^ i=l 


^ g log(Nvyz) 
n 


sup 


-Y.2ei^ih{Xi)-Ch{Xif 

n f. 


. log(Aivn)-tlogj' 


whereC is a constant depends on K-.-2[^Y.'i=i^\ln + 2C) and 

M := sup --- 

h€^\{0] 

We observe that the bound of Lemma 11 is worse than the optimal bound of [1] by an additive term. This 
is due to the fact that the analysis for finite case passes through the offset Rademacher complexity of the star hull, 
and for this case the star hull is more rich than the finite class. For this case, a direct analysis of the Star estimator 
is provided in [ ]. 

While the offset complexity of the star hull is crude for the finite case, the offset Rademacher complexity does 
capture the correct rates for regression with larger classes, initially derived in [18]. We briefly mention the result. 
The proof is identical to the one in [17] , with the only difference that offset Rademacher is defined in that paper as 
a sequential complexity in the context of online learning. 

Corollary 12 . Consider the problem of nonparametric regression, as quantified by the growth 

log.yV2(.^,e) < e~^. 

2 

In the regime p e (0,2), the upper bound of Lemma 7 scales as n 2 +p. In the regime p>2, the bound scales as n~^^P, 
with an extra logarithmic factor atp-2. 

For the parametric case of p = 0, one may also readily estimate the offset complexity. Results for VC classes, 
sparse combinations of dictionary elements, and other parametric cases follow easily by plugging in the estimate 
for the covering number or directly upper bounding the offset complexity (see [18, 17]). 


7 Lower bound on Minimax Regret via Offset Rademacher Complexity 


We conclude this paper with a lower bound on minimax regret in terms of offset Rademacher complexity. 
Theorem 13 (Minimax Lower Bound on Regret). Define the offset Rademacher complexity overSE'^'^ as 

1 


y{°{n,3']- sup Eesup^^2 e,/(j:,)-/(Xj)^ 

/e.^ [ i=\ 


then the following minimax lower bound on regret holds: 


inf sup^ E(g- Yf - inf E(/- Yf ^ > 91°((1 -t c)u,,^)- in°(cn,^), 

g€‘S p [ J 1 + c 

for any c> 0. 

For the purposes of matching the performance of the Star procedure, we can take + star(,^ - ^). 
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A Proofs 


Proof of Theorem 3. Since / is in the star hull around g, f must lie in the set J€'.- S' ^ star(^ - . Hence, in view 

of (4), excess loss S{f) is upper bounded by 


sup {(E- E)[2(/* - Y)[f* - /)] + E(/* - ff - (1 + c) ■ E(/* - ff} 

< sup {(E- E)[2(/* - Y){f* - /)] + (1 + c/4)E(/* - ff - (1 + 3c/4) ■ E(/* - ff 

-(c/4)(E(r-/)2+E(r-/)2)} 

< sup {(E- E) [2{f* - Y){f* - /)] - (c/4) (E(/* - ff + Hf* - ff]] 

+ sup {(1 + c/4)E(/* - ff - (1 + 3c/4) • E(/* - ff] 
f£.X 

We invoke the supporting Lemma 14 (stated and proved below) for the term (13): 


( 11 ) 


E sup {(1 + c/4)E(/* - ff - (1 + 3c/4) ■ E(/* - ff] 

/ejr 

Ki2 + c) if” , c 

- ^-E sup - E 2c, (/(X,) - r iXt)] - ^ 

^ f,= l 4A(2+C) 


't[f[Xi)-f*{Xi)Y 


( 12 ) 

(13) 

(14) 

(15) 


Let E' stand for empirical expectation with respect to an independent copy For the term (12), Jensen’s 

inequality yields 

E sup {(E- E) [2{f* - Y]{f* - /)] - (c/4) (E(/* - ff + E{f* - ff)] 

/e^ 

< E sup {(E - E') mf* - 7) (/* - /)] - (c/4) (E(/* - ff + E'(/'‘ - ff)]. 

/ejr 

When introducing i.i.d. Rademacher random variables, we observe that the quadratic term remains unchanged by 
renaming X, and X'., and thus the preceding expression is upper bounded by 

2E sup i - f; 2e, if* (X,) - 7,) [f* (X,) - /(X,)) - (c/4) [f* (X,) - /(X,))^ 

/e^ { n 

Using a contraction technique as in the proof of Lemma 14, we obtain an upper bound of 


1 


2M-E sup - ^ ^ 2e,(/*(X,) - /(X,)) - — ^{f* (X,) - f{Xi)f 


4M fz, 


Combining the bounds yields the statement of the theorem. 


(16) 

□ 


Lemma 14. For any class S' of uniformly bounded functions with K - sup^^^ l/loo> for any f* e S', and for any 
c> 0, it holds that 

E sup {E(/ -f*f-{l + 2c)E(/ -f*f] 

< C • E sup - 1 ^ a ifiXi) - r iXi)) - jf inXi) - /* (X,))4. 

^ i C i=\ i = l J 

Proof of Lemma 14. We write 

E sup {E(/ -f*f-{l + 2c)E(/ - f*f] 

/e.^ 

= Esup {(1 + c)E{f-f*f - (1 + cMf-ff - cElf-ff - cE{f-f*f] 
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which, hy Jensen’s inequality, is upper bounded by 


Esup{(l + c)(E'(/-/*)2-E(/-/*)2)-cE'(/-/*)2-cE(/-/*)2} 

/e^ 

We recall that E' is an empirical mean operator with respect to an independent copy (Xj,..., X'„). Writing out the 
empirical expectations in the above expression, the above is equal to 


Esup 

/e^ 


I ^ E - (/(Xij - r - ci'if-rf - cUf-rf^ 


< 2•E sup 

/e^ 


1 + c 


J^ei{f{Xi)-r{Xi)f-cE[f-rf 


i=\ 


with the last expectation taken over e,- and data X;, 1 < i < n. 

We proceed with a contraction-style proof. Condition on Xi,...,X„ ande 2 ,. ..,e„, and write out the expectation 
with respect to ei: 


1 


sup 

■ 


—f Ei ifiXi) - r - ci{f -rf+— (/(Xi) - r ixi]f 

n 1^2 n 


1-tc 


1-tc 


+ - sup ^^ e/(g(X,) - /*(X;))2 - cE(g- /*)2-(g(Xi) -/*(Xi)) 


1 

< - sup 
2 


i=2 


^ etifiXi) - f* {Xi)f - cE(f-f*f + - Y. edgiXi) - f* {Xi)f 


i=2 


i=2 


-ci{g-rf+ ^^^^i/(Xi) - g(Xi)i 

n 

The absolute value can be dropped since the expression is symmetric in /, g. We obtain an upper bound of 


1 


- sup 
■ /.ge^ 


^ ^E ^tifiXi) - r (Xi)f - cHf -f*f+^Y £tig(Xi) - r iXi]f 


i=2 


i=2 


-cE(g -/* )2 + (/(Xi) - g(Xi)) 


= E(-j sup 
f€a^ 


l — Yci ifiXi) - f* {Xi)f - cl{f -rf + I 

[ ” i=2 ^ J 


Proceeding in this fashion for £2 until £■„, we conclude 

E sup {E(/ -f*f-a + 2cMf - f*f} 
f€^ 


< E sup ■( 

/e^ { 


4X(1 -t c) 


E ^tifiXi) - f* {Xi)) --Y IfiXi) - f* [Xi)Y 


i=l 


1=1 


where we added /* back in for free since random signs are zero-mean. □ 

Proof of Theorem 4. We start with the deterministic upper bound (11) on excess loss (see the proof of Theorems): 

(17) 


sup {(E - E) [2^h] -t Eh^ - (1 -t c) • Eh^} 
hEje 


where h - f - f* e J€. Define 


UxuY, (h) = 2^ih{Xi) - E[2^h] + Eh^ - (1 + c) ■ h{Xi)^, 
Vxt.Y, {h) = 2^ih{Xi) - E[2^h] - Eh^ + (1 - c') ■ h{Xi)^. 
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where c' will be specified later. We now prove a version of probabilistic symmetrization lemma [7, 13] for 


n ' 

sup Y.^Xi,Yiih)> X 


(18) 


Note that unlike the usual applications of the technique in the literature, we perform symmetrization with the 
quadratic terms. Define 


(19) 


f n 1 


^ n 

< sup E Uxi.Yi lh)> x> 

, = inf P 


{h€JY‘i=l J 


1=1 


Clearly for {Xt, e there exists a.hEj6’ satisfies condition in S^. If in addition h satisfies 

E Vx'Y'Sh) < ^ 


then 


and therefore 


i=\ 


ZUx,Mh)-Vx'.,Y'W>- 

i=l ' ' 


sup E Uxi,Yi W - Vx[,Y'{h) > -. 

hEjei=i ' ‘ ^ 


The latter can be written as 


sup { X 2^ih{Xi) - 2i,\h{X\) + 2Eh2 - (1 + c) ■ HXiY - (1 - c') ■ h[X\Y ^ > -. 
i=l I 2 


Then for this particular h, 
inf 

<F 




E^x:,F'(g)< T.^x'.,y'W<- 


E ^Xi,Yi W - Vx[_Y! ^ P sup Y, UXi.Yi W - Vx'^Y! W > ^ 

' ’ 2 ‘ ‘ 2y 


Vi=l ‘ \h^jei=\ 

Note that the right-hand-side does not depend on h. We integrate over {Xi, e ^ to obtain 


J 0 -P 


sup YUx,.YYh)>x 

^h€JYi=l 


< P [ sup n ■ {2(E- E'll^'h] -i- 2Eh^ - (1 -i- c) • Eh^ - (1 - c') ■ E'h^} > — ] 
Vtejif 2) 


( 20 ) 


Next, we apply Assumption 1 with e = c/4 = 1/72 to terms in (20) to construct an offset Rademacher process. Note 

2 


1 - c 


We can now choose c, c' > 0 in that satisfy 

2 , 

< 2 -I- c - c - 2c 


1 -e 

Choose b now such that 


< 2(1 + 26 ') — 2 + c. 


l-(l-c'-c)(l-c) < (1-HC-C)(1-C)-1. 


( 21 ) 


l-(l-c'-c)(l-e)< h<(l-hc-c)(l-e)-l. 


( 22 ) 
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Then we have on the set J€, applying lower isometry hound and Eq. (22), with probahility at least 1 - 25, 

^ (l + h)E/l2-(l + c)•Eh2<-c•E/l^ 

^ {l-b]Eh^-a-c']-Ph^<-c-Ph^. 

Thus we can continue bounding the expression in (20) as 

sup n ■ {2(E- E') [^h] + 2Eh^ - (1 + c) • Eh^ - (1 - c') ■ E'h^} 
h€JV 

- sup n ■ {2(E- E')[^h] + (1 + b)Eh^ - (1 + c) • Eh^ + (1 - b)Eh^ - (1 - c') • E'h^} 
h£.^ 

< sup n • {2(E - E') [<h] - c ■ Eh^ - c • E' } 


For the probability of deviation, we obtain 




n 


sup ^ Uxi,Yi ih) > 
^h€jiei=l 


X 

; 


<P sup n-{2[E-E')[ih]-c-Eh^-c-E'h^}>-]+2S 

2 j 

= P sup n- {2[E-E')[e^h] - c-Eh^ - c-E'h^} > -]+2d 
\h£je 2 ) 


<2P 


sup \ Y,2£i^ih[Xi)-c-Y,h[XiY 

[!=i i=i 


X 

> — 
4 


+ 25. 


To estimate p, write 


P^ inf I 


ZVx,.YPh]<- 

V! = l ^ 


= 1 - sup I 

tie+if 


^ 2^ih{Xi) - E[2^h] - Eh^ + (1 - c') • h{Xif > - 

Vi=i 2y 


Let’s bound the last term in above equation, for any heJ6' 


p((E-E)[2^h] + (l-c')Eh^ 


-Eh^> — 
2n) 


<P I (E- E) [2^h] >j^ + 1 + P I (E- E)[h‘'] > 


2(1-c') 


-Eh' 


(23) 

(24) 

(25) 

(26) 
(27) 


Define 


A sup 


Eh^ 


and iJ:=supE(f . 

X.Y 


h€^ (Eh^)“ 

Then for the second term in Eq (26) , using Chebyshev’s inequality 


P (E-E)[h^] > 


n C p) 4(l-c'fA 

Eh2|<-!^——^—<1/4 


2(1-c') 


if 


n > 


16(1-c')2^ 
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For the first term in Eq (26), note 


yar[2^h] < 4E[^^h^] < 4 \/aB - Eh^ 


and thus through Chehyshev inequality 


X c 


(E-E)[2^h]>—+ -Eh"|< 


4s/AB-Eh^ 


4^/AB-Eh^ ^ 1 
%-4^-|Eh2-4 


if 


X > 


16\/AB 


Assemble above bounds, for any heJ6’ 


sup 11 


f; 2^ih{Xi) - E[2^h] - Eh^ + (1 - c') ■ hiXif > f 

Vi=i 


1 

< — 
2 


which further implies ^6 > 1/2 for any x > and whenever 


n > 


ma-c'fA 


Under the above regime. 


sup ^ Uxi.Yiih) > X 
^h€^i=l 



f 1 

< 2P 

sup^ 


sup ^ Y, £i^ih{Xi) “ Z > I 


+ 2d 


and so 


We conclude by writing 


sup Y Uxi,Yiih)>4t 


< 4P 


[ ! = 1 


sup < Y^iiihiXi) -C'Y h[Xif f > t 


i=l 


+ 48. 


sup (E- E)[2^h] + Eh^ - (1 + c) • Eh^ > 4f 


hzJC 

/ 


<4P 


sup -Y^i^ih{Xi)-c-YhlXif> t 

^ i=i i=i 


+ 48. 


□ 


Proof of Lemma 5. Using a standard argument. 


Ef max 

ueV 


Y^iVi-Cvj 


i=l 


< 


1 

A 


log Y E^exp 

U€V 


J^AcfUi-ACu? 

1=1 


For any ve V, 


E(-exp 


n 

Y^ei^i 


i=l 


-ACv^ 


< exp 


i=l 


< 1 
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by setting A = 2C. The first claim follows. For the second claim, 


max 

veV 


Y,eiVi-Cv 

i=l 


>—log(iV/5) 


< E exp ■( A max 

' V€V 






i=l 

n 




/=i 


< y Eexp-| A 

veV [ 

< y exp{-log(7V/5)} = 5. 

V€V 


-A—log(W<5) 


-A—log(iV/5) 


Now let’s move to the case where the noise is unbounded. 


n v€V 


i=i 


Ee-rn^\Y.ei^iVi-Cv^ < —\ogEe Y. ^ 


veV 


( n 1,2 




veV \i=l 


max 


V 

logN 


I'Ei^hoi 2CZ"=i 


2 CX”_ 

if we take A = minyey\{ 0 ) „ '"zrj ■ The high prohability statement follows also use this particular choice of A. 

^i=l 


□ 


Proof of Lemma 6. The proof proceeds as in [17]. Fix y e [0,1]. By definition of a cover, there exists a set V c I 
vectors of size N = J/i j) with the following property: for any there exists a y = v[g]eV such that 


1 


- J^{g(z0-Vif <j^. 


1=1 


Then we may write, 


EeSUp 

geW 


n 


E 


EigiZi'l-CgiZif 


< Ee sup 


1 ” 

- E^jTgUi)- I'lgli) 

n t=i 


+ Ee sup 
ge'^ 


Z^C/4)vlgf, 


t=l 


CgiZif 


+ Ee sup 

geW 


ye;ll[g];-(C/4)i;[g]2 


t=l 


We now argue that the second term is nonpositive. More precisely, we claim that for any g e 


1 

4 


E ^^sfi ^ E sizif 


t=l t=l 


(28) 

(29) 


(30) 


for some element i;[g] e V u {0}. First, consider the case Zf=i g(Zi]^ < y^. Then t;[g] = 0 is an element y-close 
to values of g on the sample, and (30) is trivially satisfied. Next, consider the case Y.%\giZif' > y^ and write 
u- (g(zi),..., g(Zn)). The triangle inequality for the Euclidean norm yields 


Iklg]II < Iklg] - Mil + II Mil < 7 + II Mil < 2|| nil, 

establishing non-positivity of the second term in (28). The third term in (28) is upper bounded with the help of 
Lemma 5 as 


Egmax 

g£<S 


Y.eiV[g]i-{CIA)v[g\] 

t=i 


< — iogM^2(^,r) 


Finally, the first term in (28) is upper bounded using the standard chaining technique, keeping in mind that the 
^2 -diameter of the indexing set is at most y. □ 
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Proof of Lemma 7. The proof is similar to the proof of Lemma 6 . We proceed with the following decomposition: 


sup 


- E - Cg(Zif 


t=l 


< sup 


1 " 

- X,ei(g(Zi)- ulg]i) 


t=i 


+ sup 


” f=l 


For the first term, we can employ the traditional high prohahility chaining hound. For some c> 0, the following 
holds, 


sup 

[g€^ 


1 




t=i 


> u ■ inf i4a + 
ae| 0 ,y] 


12 rr 


\/n 


f S)dS 

Ja * 


1-e 


^exp(-cu ). 


For the second term. 


sup 

Vge» 


- Eeif^[g]i-(C/4)v[g] 
^ t=l 


2 log^('i^, 7 ) + u 
C n 


< exp(-u). 


Comhining the above two bounds, we have 


sup 

sup 

,ge® 

/ 


-y£ig[Zi)-Cg[Zif >u- inf \4a+^[ ^ 

n t=\ a€[o,r] I VnJa '' ) C 

-y^iigiZi)-v[g]i) >u- inf \4a+^f J\ogjY 2 {^, 8 )d 8 
^ t=l aelO.yl I \/nJa 


2 log^yf^Ci^.y) + u 


sup 

\ge» 


-yeiV[g]i-{CI4]v[g]] 


t=i 


),ri 

2 log,yV 2 ('i^, 7 ) + u 
C n 


1-e- 


exp(-CM ) + exp(-M). 


□ 


Proof of Theorem 9. Denote by SS the unit ball with respect to £2 distance, SS-{h: (Eh"^) < 1}, and let 5^ denote 

the unit sphere. Choosing any h e ^\rSS, we have || hWg^ > r = a„{^,K',8) with k' to be chosen later. Under the 
assumption that is star-shaped, we know hr'.- rl\\h\\e^-hE thus 


i=l 


n t: 


/=i 


: - ^ - E e/f i hr {Xi) - -^ c'-yhi [Xi) 

11 ^ 11^2 f 2 ^ ^ 11 ^ 11^2 ( 11 ^ 11^2 


^ - E et^ihriXi) - c'- E hliXi) - ^ ^ - 1 c'- E KiX,). 

'■ I ” /Ti «J r \ r I 

Comparing the supremum of the offset Rademacher process outside the ball rSS with the one inside the ball rSS, 
we have 


2 " 




sup \-'yeiiih{Xi)-c'-yy{Xi)\- sup \-yci^ihiXi) - c'-yh^iXi) 
heje\rSg { ” i=l ^ 1=1 J he.X’nrSg [ ^ i=i 


! = 1 


< sup 

h£.X\rm 


< sup 

h£.X\rm 




sup \ 


sup l-yei^ihr{Xi)-c'-yyr^Xi) 




i=l 


While, l\\h\\e, 
r V r 


l| inf ]c'-E^r(^i) 

hr^.Xnr£^ \ ^ !=l 


i(E^_l]i sup \-t^d 
[\ r I [hr€.i^nrss[nf^i 


i=i 


hr iXi) -c'-yhl [Xi] I - inf i c' - E (Xi) 




i=l 


(31) 
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If 


K'r^<c'(l-e'ir^, 


we can apply the lower isometry bound 1 and conclude 


sup 


2 " 


[■ < < c'(l - e)r^ < inf 


h€^nrSS I ^ ;=i 


1=1 


f 1 ” 


hr^^nrS^ [ ^ 1=1 


h^AXi) 


with probability at least 1 - 2 d. 

Under this event, the difference of terms in (31) is smaller than 0, and we conclude 


1 


h€je\rSS [ ” 1=1 

' ^\h\\e, 

(WhWe, 


1=1 


sup <sup < - ^c;f,h(X/) - c'-^ h^(X,) 


I 


1 


< sup 

h£.X\ rSS 


hEJCnrSS \ ^ 1=1 


1=1 




< sup 

h^.X\rSS 


I r 


-iN sup \-Y,£AiKiXi)-c' -Y,hA^i)\- inf Ic’-Y^KiXi) 
I [hr^^nras i=l '^ 1=1 J hr€^rnr£^ { ”1=1 

- 1 ] [K'A-c'a-e)A)\<0 


Thus the excess loss is upper bounded by the offset Rademacher process, and the latter is further bounded by the 
process restricted within the critical radius: 


sup 

h^.X 


-Y^eAih{Xi)-c'-Y^h^{Xi) 

n itl n 


< sup 

liejifnrSg 


-Y^eAih{Xi]-c'-Y^h^{Xi) 


i=l 


■ i=l 


< c'(l-c),d) 


with probability at least 1 - 2 d. 


(32) 

(33) 

□ 


Proof of Theorem 13. Denote ^ + star(.^ - The minimax excess loss can be written as 


inf sup ^ E(g - F)2 - inf E(/ - F)H 

= inf sup j {-E2Fg + Eg^} + sup {E2Yf-Ef} I . 

P [ f€^ J 

Now let’s construct a particular distribution P in the following way: take any xi, X 2 ,..., X[\+c)n £ SC and let Px be 
the uniform distribution on these (1 + c)n points. For any e - (ci,...,e(i+c)n) e {+1}*^^'^'”, denote the distribution 
Pf of {X, Y] indexed by e to be: X is sampled from Px, and F|x=;i:, = £i, VI < / < (1 + c)n. Note here g : (X, F)®” ^ 
+ star(,^ - . Now we proceed with this particular distribution 


inf sup \ {-E2Fg+ Eg^} + sup {E2F/ - Ef} 
g€y P ( f€^ 


> inf sup eJ {-E2Fg + Eg2} + sup{E2F/-E/2} 

[ I [ (l+e)n 

> sup EJsup--— ^ Y. 2c,/(Xj)-/(Xi)2 

-sup sup Ee{2EFg-Eg^}. 
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Note that the first term is exactly 9f°((l + c]n,^). Let us upper hound the second term. Denote the indices of 
a uniform n samples from (1 + c)n samples with replacement as i\, i 2 ,..., in, and I he the set of unique 

indices \I\< n. Observe that g is a function of (x/, F/) only, independent of ej,j 11. 

sup sup Ee{2EFg-Eg^} 

f Y {l+c)n 

<sup sup EeE,-,f E {2eig{Xi)-giXif} 

[ \L -r C)H 

f Y Ci+c)rt 

= sup sup E/j,..., 4 Eet --— ^ {2£-/g(Xi)-g(Xj)^} ^ (34) 

Conditionally on q, q, in, 


1 

(1 + c)n 


^ {2e, g(x;) - g(x,)^} = 0 - 
UI 


1 

(1 + c)n 


iti 


Expression in (34) is upper hounded hy 


sup sup Eii,..„i„Ee 


^L{2s,ite,)-stein 


^supE;^.sup E 


1 


E{ 2 eig(x;)-g(Xi)^} 


< sup sup E| 


ge'S^fei'Ljea:®'' [ (1 + c)n 


1 


j^{2eig{Xi)- gixif} 


sup Ee sup < - -— E {2e-ig(x,) - g(x;)^} 

{Xi}"^jea;®'‘ g€<s [ (i + c)« ,'=i 


1 + c 




Thus the claim holds. 


Proof of Lemma 11. From Lemma 8, we know for J6’ - ^ - f* + star(,^ - 

2 

log^(>?lf,8£-) < log^(,^-/*,4e) +log^(star(,^-,^),4e) < log - + 31og.yf2('^.e)- 

e 

Consider the d-covering net of J6‘, where for any h e J6’, v[h] is the closest point on the net. 


□ 


- sup \ Y,2ei^ih{Xi)- CHXif 
n I i=i 


< - sup t '^2ei^i{h{Xi] - v[h\) - C[h{Xif - v[h]^] 1+ - sup t E 2e/f, i;-Ci;“ 


( i=l 

Fn f n 

< 2(t / E /” + 2C)-5+— sup <Y2ei^iV-Cv^ 

V i = l V€.A-2^.^,S) \i = l 


)l+- sup \Y,2ei£,iV-Cv^\ 

] ” !2e./(^2(^,0 li=i J 
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The second term is the offset Rademacher for a finite set of cardinality at most log(16/d) + 3\ogN, thus applying 
Lemma 5, 



^ logCAfv n) 


n 


where K ^ defined in Equation (6). We also have the high prohahility bound via 

Lemma 5: 



log(Afvn) + M 


n 


< e 


-U 


□ 
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